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Abstract —The computational cost of many signal processing 
and machine learning techniques is often dominated by the 
cost of applying certain linear operators to high-dimensional 
vectors. This paper introduces an algorithm aimed at reducing 
the complexity of applying linear operators in high dimension 
by approximately factorizing the corresponding matrix into 
few sparse factors. The approach relies on recent advances 
in non-convex optimization. It is first explained and analyzed 
in details and then demonstrated experimentally on various 
problems including dictionary learning for image denoising, and 
the approximation of large matrices arising in inverse problems. 


Index Terms —Sparse representations, fast algorithms, dictio¬ 
nary learning, low complexity, image denoising, inverse problems. 


I. Introduction 

S parsity has been at the heart of a plethora of signal 
processing and data analysis techniques over the last two 
decades. These techniques usually impose that the objects 
of interest be sparse in a certain domain. They owe their 
success to the fact that sparse objects are easier to manipulate 
and more prone to interpretation than dense ones especially 
in high dimension. However, to efficiently manipulate high¬ 
dimensional data, it is not sufficient to rely on sparse objects: 
efficient operators are also needed to manipulate these objects. 

The n-dimensional Discrete Fourier Transform (DFT) is 
certainly the most well known linear operator with an efficient 
implementation: the Fast Fourier Transform (FFT) Q, allows 
to apply the operator in G(nlogn) arithmetic operations 
instead of 0 {rF) in its dense form. Similar complexity savings 
have been achieved for other widely used operators such as 
the Hadamard transform 0, the Discrete Cosine Transform 
(DCT) Q or the Discrete Wavelet Transform (DWT) |j^. For 
all these fast linear transforms, the matrix A corresponding 
to the dense form of the operator admits a multi-layer sparse 
expression, 

.7 

A=ns,. (1) 

i=i 

corresponding to a multi-layer factorizatiorQinto a small num¬ 
ber J of sparse factors Sj. Following the definition of a linear 
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*The product being taken from right to left: 0^=1 ® J • • • Si 


algorithm given in |[7j, this multi-layer sparse factorization is 
actually the natural representation of any fast linear transform. 
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Fig. 1. The Hadamard matrix of size n X n with n = 32 (left) and 
its factorization. The matrix is totally dense so that the naive storage and 
multiplication cost 0{n^ = 1024). On the other hand, we show the 
factorization of the matrix into log 2 (n) = 5 factors, each having 2n = 64 
non-zero entries, so that the storage and multiplication in the factorized form 
cost 0{2nlog2(n) = 320). 

For example each step of the butterfly radix -2 FFT can be 
seen as the multiplication by a sparse matrix having only two 
non-zero entries per row and per column. This fact is further 
illustrated in the case of the Hadamard transform on Figure 
For other examples, see e.g. |[T] Appendix A]. 

Inspired by these widely used transforms, our objective is 
to find approximations of operators of interest encountered 
in concrete applications, as products of sparse matrices as in 
0 - Such approximations will be called Flexible Approximate 
MUlti-layer Sparse Transforms (FApST). 

As a primary example of potential application of such 
approximations, consider linear inverse problems, where data 
and model parameters are linked through a linear operator. 
State of the art algorithms addressing such problems with 
sparse regularization m-m are known to heavily rely on 
matrix-vector products involving both this operator and its 
adjoint. As illustrated in Section [V] on a biomedical inverse 
problem, replacing the operator by an accurate FA/iST has 
the potential to substantially accelerate these methods. 

To choose a regularizer for inverse problems, dictionary 
learning is a common method used to learn the domain in 
which some training data admits a sparse representation un¬ 
its applicability is however also somewhat limited by the need 
to compute many matrix-vector products involving the learned 
dictionary and its adjoint, which are in general dense matrices. 
We will see that recent approaches to learn fast dictionaries 
ITg,® can be seen as special cases of the FAp,ST dictionary 
learning approach developed in Section [Vlj where the learned 
dictionaries are constrained to be FA/rST. 

Beyond the above considered examples, any task where 
it is required to apply a linear operator in high dimension 
would obviously benefit from a FA/iST corresponding to the 
considered operator. For example, in the emerging area of 
signal processing on graphs P6[ , novel definitions of usual 
operators such as the Fourier or wavelet transforms have been 
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introduced. They have no known general sparse forms, and 
consequently no associated fast algorithms. Finding multi¬ 
layer sparse approximations of these usual operators on graphs 
would certainly boost the dissemination and impact of graph 
signal processing techniques. 

Objective. The quest for multi-layer sparse approximations of 
large linear operators, which is the core objective of this paper, 
actually amounts to a matrix factorization problem, where the 
matrix A corresponding to the dense form of the operator is 
to be decomposed into the product of sparse factors Sj, so as 
to satisfy an approximate form of Q- 
Contributions. This paper substantially extends the prelimi¬ 
nary work started in |[T|, Q and p7) , both on the theoretical 
and experimental sides, with the following contributions: 

• A general framework for multi-layer sparse approxima¬ 
tion (MSA) is introduced, that allows to incorporate 
various constraints on the sought sparse form; 

• Recent advances in non-convex optimization m are 
exploited to tackle the resulting non-convex optimization 
problem with local convergence guarantees; 

• A heuristic hierarchical factorization algorithm leveraging 
these optimization techniques is proposed, that achieves 
factorizations empirically stable to initialization; 

• The versatility of the framework is illustrated with ex¬ 
tensive experiments on two showcase applications, linear 
inverse problems and dictionary learning, demonstrating 
its practical benefits. 

The remaining of the paper is organized as follows. The 
problem is formulated, related to prior art and the expected 
benefits of FA/rSTs are systematically explained in Section [H] 
A general optimization framework for the induced matrix fac¬ 


torization problem is introduced in Section 111 and Section IV 


and as a first illustration we demonstrate that it is possible to 
reverse-engineer the Hadamard transform. Several applications 
and experiments on various tasks, illustrating the versatility of 
the proposed approach are performed in sections |V] and VI 


II. Problem formulation 

Notation. Throughout this paper, matrices are denoted by bold 
upper-case letters: A; vectors by bold lower-case letters: a; 
the zth column of a matrix A by: a^; and sets by calligraphic 
symbols: A. The standard vectorization operator is denoted 
by vec(-). The is denoted by H-Hp (it counts the 

number of non-zero entries), Ij-ljp, denotes the Frobenius 
norm, and 11-112 the spectral norm. By abuse of notations, 
||A||o = ||vec(A)|jo. The identity matrix is denoted Id. 


A. Objective 

The goal of this paper is to introduce a method to get 
a FA/rST associated to an operator of interest. Consider a 
linear operator corresponding to the matrix A G The 

objective is to find sparse factors G ^ j g {1... J} 

with ai = n and oj+i = m such that A « UU Su This 
naturally leads to an optimization problem of the form: 


Minimize 


j 

A-n^j 

Data fidelity 


J 


2 = 1 


Sparsity-inducing penalty 


( 2 ) 


to trade-off data fidelity and sparsity of the factors. 


B. Expected benefits of FAp,STs 

A multi-layer sparse approximation of an operator A brings 
several benefits, provided the relative complexity of the fac¬ 
torized form is small with respect to the dimensions of A. 
For the sake of conciseness, let us introduce Sj = ||S_,j|p 
the total amount of non-zero entries in the jth factor, and 
Stot = J2j=i *^he total number of non-zero entries in the 
whole factorization. 


Definition II.l. The Relative Complexity (abbreviated RC) is 
the ratio between the total number of non-zero entries in the 
FApST and the number of non-zero entries of A: 


RC:= 


Stot 


( 3 ) 


It is also interesting to introduce the Relative Complexity Gain 
(RCG), which is simply the inverse of the Relative Complexity 
(RCG= l/RC). 


The aforementioned condition for the factorized form to be 
beneficial writes: RC ^ 1 or equivalently RCG ^ 1. 

FA/rSTs reduce computational costs in all aspects of their 
manipulation, namely a lower Storage cost, a higher Speed of 
multiplication and an improved Statistical significance. 

1) Storage cost: Using the Coordinate list (COO) storage 
paradigm 1191, one can store a FA/iST using O(stot) floats and 
integers. Indeed each non-zero entry (float) in the factorization 
can be located using three integers (one for the factor, one for 
the row and one for the column), which makes Stot floats and 
3stot integers to store. One needs also J -f 1 supplementary 
integers to denote the size of the factors oi to aj+i. In 
summary the storage gain is of the order of RCG. 

2) Speed of multiplication: Applying the FA/iST or its 
transpose to a vector v g K" can be easily seen to require 
at most 0{stot) floating point operations (flops), instead of 
0{mn) for a classical dense operator of same dimension, so 
the computational gain is, like the storage gain, of the order 
of RCG. 

3) Statistical significance: Another interesting though less 
obvious benefit of FA/rSTs over dense operators arises when 
the operator has to be estimated from training data as in dictio¬ 
nary learning. In this case the reduced number of parameters to 
learn -O(stot) compared to 0{mn) for dense operators- leads 
to better statistical properties. More specifically, the sample 
complexity is reduced pO) , and better generalization proper¬ 
ties are expected. The sample complexity gain is of the order 
of RCG, as will be shown in the case of dictionary learning. 
The impact of these gains will be illustrated experimentally in 
section 


VI on image denoising with learned dictionaries. 


C. Related work 

Similar matrix factorization problems have been studied in 
several domains. Some are very classical tools from numerical 
linear algebra, such as the truncated SVD, while other emerged 
more recently in signal processing and machine learning. 
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1) The truncated SVD: To reduce the computational com¬ 
plexity of a linear operator, the most classical approach is 
perhaps to compute a low-rank approximations with the trun¬ 
cated SVD. Figure 1^ compares the approximation-complexity 
trade-offs achieved via a low-rank approximation (truncated 
SVD) and via a multi-layer sparse approximation, on a forward 
operator associated to an MEG inverse problem. The truncated 
SVD and four FA/iSTs computed using different conhgura- 
tions (more details in Section |V] - Figure are compared in 
terms of relative operator norm error; ||A — A|| 2 /||A|| 2 . Itis 
readily observed that the FA/iSTs achieve significantly better 
complexity/error trade-offs. 



Fig. 2. Comparison between four multi-layer sparse approximations corre- 
ponding to different configurations and the truncated SVD. A 204 X 8193 
matrix associated to an MEG inverse problem is used for this example. 


2 ) Local low-rank approximations: Given the limitations of 
global low-rank approximation by truncated SVD illustrated in 
Figure the numerical linear algebra community has devel¬ 
oped local approximations of operators by low-rank patches. 
This general operator compression paradigm encompasses 
several methods introduced in the last three decades, including 
the Fast Multipole Method (FMM) | [^ , H-matrices p2) and 
others | |2^ . All the difficulty of these methods resides in 
the choice of the patches, which is done according to the 
regularity of the subsumed continuous kernel. This can be 
seen as approximating the operator by a FA/iST, where the 
support of each factor is determined analytically. The approach 
proposed in this paper is data-driven rather than analytic. 

3) Wavelet-based compression: This operator compression 

paradigm, introduced in p4) , is based on the use of orthogonal 
wavelets in the column domain (associated to a matrix $i), 
and in the row domain (associated to a matrix # 2 ) of the 
matrix A to approximate. By orthogonality of the matrices 
$1 and $ 2 . we have A = $i$^A$ 2 ^^- The wavelet-based 
compression scheme relies on the fact that B = $^A $2 
is compressible (provided appropriate wavelets are chosen 
relative to the subsumed kernel). This implies B « B where 
B is sparse so that A « A = Fast multiplication by 

A is possible as soon as B is sparse enough and the wavelet 
transforms $1 and $2 have fast implementations. This can be 
seen as approximating A by a FA/iST. 

4) Dictionary learning: Given a collection of training vec¬ 
tors y^, 1 < ^ < T gathered as the columns of a matrix Y, the 
objective of dictionary learning p3| , |25| is to approximate 
Y by the product of a dictionary D and a coefficients matrix 
r with sparse columns, Y « DF. 

To learn dictionaries with improved computational effi¬ 
ciency, two main lines of work have begun to explore ap¬ 
proaches related to multi-layer sparse approximation. In p^. 


the authors propose the sparse-KSVD algorithm (KSVDS) to 
learn a dictionary whose atoms are sparse linear combinations 
of atoms of a so-called base dictionary Dbase- The base dic¬ 
tionary should be associated with a fast algorithm (in practice, 
this means that it is a FA/iST) so that the whole learned 
dictionary is itself a FA/iST. It can be seen as having the 
J— 1 leftmost factors hxed in (|^, their product being precisely 
Dbasej while the first factor Si is the sparse representation of 
the dictionary over the base dictionary, i.e., D = DbaseSi. 

A limitation of the sparse-KSVD formulation is that the 
learned dictionary is highly biased toward the base dictionary, 
which decreases adaptability to the training data. In |15|, the 
authors propose to learn a dictionary in which each atom is 
the composition of several circular convolutions using sparse 
kernels with known supports, so that the dictionary is a sparse 
operator that is fast to manipulate. This problem can be seen 
as with the penalties gjs associated to the J — 1 leftmost 
factors imposing sparse circulant matrices with prescribed 
supports. This formulation is powerful, as demonstrated in 
1 15 1 , but limited in nature to the case where the dictionary is 
well approximated by a product of sparse circulant matrices, 
and requires knowledge of the supports of the sparse factors. 


5 ) Inverse problems: In the context of sparse regularization 
of linear inverse problems, one is given a signal y and a 
measurement matrix M and wishes to compute a sparse 
code 7 such that y « M 7 , see e.g. |j2^. Most modern 
sparse solvers rely on some form of iterative thresholding and 
heavily rely on matrix-vector products with the measurement 
matrix and its transpose. Imposing -and adjusting- a FA/rST 
structure to approximate these matrices as proposed here 
has the potential to further accelerate these methods through 
fast matrix-vector multiplications. This is also likely to bring 
additional speedups to recent approaches accelerating iterative 
sparse solvers through learning p7|-p9|. 

6) Statistics - factor analysis: A related problem is to 
approximately diagonalize a covariance matrix by a unitary 
matrix in factorized form Q, which can be addressed greedily 
0 , using a hxed number of elementary Givens rotations. 
Here we consider a richer family of sparse factors and leverage 
recent non-convex optimization techniques. 


7) Machine learning: Similar models were explored with 
various points of view in machine learning. For example, 
sparse multi-factor NMF can be seen as solving prob¬ 
lem (|^ with the Kullback-Feibler divergence as data hdelity 
term and all factors SjS constrained to be non-negative. Opti¬ 
mization relies on multiplicative updates, while the approach 
proposed here relies on proximal iterations. 


8) Deep learning: In the context of deep neural networks, 
identihability guarantees on the network structure have been 
established with a generative model where consecutive net¬ 
work layers are sparsely connected at random, and non- 
linearities are neglected p^ , p4) . The network structure in 
these studies matches the factorized structure ([T]), with each 
of the leftmost factors representing a layer of the network and 
the last one being its input. Apart from its hierarchical flavor, 
the identification algorithm in p^ , p4| has little in common 
with the proximal method proposed here. 
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9) Signal processing on graphs: Similar matrix factoriza¬ 
tions problems arise in this domain, with the objective of 
dehning wavelets on graphs. First, in p5j the authors propose 
to approximately diagonalize part of the graph Laplacian 
operator using elementary rotations. More precisely, the basis 
in which the Laplacian is expressed is greedily changed, 
requiring that at each step the change is made by a sparse 
elementary rotation (so that the wavelet transform is multi¬ 
layer sparse), and variables are decoirelated. The Laplacian 
ends up being diagonal in all the dimensions corresponding to 
the wavelets and dense in a small part corresponding to the 
scaling function (the algorithm ends up being very similar 
to the one proposed in |3T)). Second, in the authors 
propose to dehne data adaptive wavelets by factorizing some 
training data matrix made of signals on the graph of interest. 
The constraint they impose to the wavelet operator results 
in a multi-layer sparse structure, where each sparse factor is 
further constrained to be a lifted wavelet building block. The 
optimization algorithm they propose relies on deep learning 
techniques, more precisely layer-wise training of stacked auto¬ 
encoders iz). 


HI. Optimization framework 
A. Objective function 

In this paper, the penalties gjf ) appearing in the general 
form of the optimization problem Q are chosen as indicator 
functions 5£^{-) of constraint sets of interest £j. To avoid 
the scaling ambiguities arising naturally when the constraint 
sets are (positively) homogeneou^ it is common fl^ , (32| 
to normalize the factors and introduce a multiplicative scalar 
A in the data hdelity term. For that, let us introduce the sets 


of normalized factors Afj = {S € 




= 1 }, 


and impose the following form for the constraints sets: £j = 
JVj n Sj, where Sj imposes sparsity explicitly or implicitly. 
This results in the following optimization problem: 


Minimize 

A,Si,...,S , 


Jk(Si,...,Sj,A) := 


i A-An S, 

+ E 

i=i 

(4) 

As will be made clear below, the used minimization al¬ 
gorithm relies on projections onto the constraint sets Sf. the 
choice of the “sparsity-inducing” part of the constraint sets Sj 
is quite free provided that the projection operator onto these 
sets is known. 

A comon choice is to limit the total number of non-zero 
entries in the factors to s,. The constraint sets then take 


the form £j = {S G 


paj + i Xaj 


lo < 


= !}■ 


Another natural choice is to limit to kj the number of non¬ 
zero entries per row or column in the factors, which gives for 
example in the case of the columns = {S € : 

||si||g < fcjVi, ||S||^ = 1}. Other possible constraint sets 
can be chosen to further impose non-negativity, a circulant 


structure, a prescribed support, etc., see for example 1151. 


^This is the case of many standard constraint sets. In particular all unions of 
subspaces, such as the sets of sparse or low-rank matrices, are homogeneous. 


Besides the few examples given above, many more choices 
of penalties beyond indicator functions of constraint sets can 
be envisioned in the algorithmic framework described below. 
Their choice is merely driven by the application of interest, 
as long as they are endowed with easy to compute projections 
onto the constraint sets (in fact, efficient proximal operators), 
and satisfy some technical assumptions (detailed below) that 
are very often met in practice. We leave the full exploration 
of this rich held and its possible applications to further work. 


B. Algorithm overview 

Problem Q is highly non-convex, and the sparsity-inducing 
penalties are typically non-smooth. Stemming on recent ad¬ 
vances in non-convex optimization, it is nevertheless possible 
to propose an algorithm with convergence guarantees to a 
stationary point of the problem. In 118|, the authors consider 
cost functions depending on N blocks of variables of the form: 


N 


$(xi, 


,XAr) :=7T(xi,...,XAr) +V/,(x^), (5) 


i=i 


where the function H is smooth, and the penalties /^s are 
proper and lower semi-continuous (the exact assumptions are 
given below). It is to be stressed that no convexity of any kind 
is assumed. Here, we assume for simplicity that the penalties 
fjS are indicator functions of constraint sets 7). To handle 
this objective function, the authors propose an algorithm called 
Proximal Alternating Linearized Minimization (PALM) p^ , 
that updates alternatively each block of variable by a proximal 
(or projected in our case) gradient step. The structure of the 
PALM algorithm is given in Figure ^ where PTj{') is the 
projection operator onto the set 7j and c* dehnes the step size 
and depends on the Lipschitz constant of the gradient of H. 


PALM (summary) 


I: 

for i G {1 • • • 

Niter} do 

2: 

for j G {1 

• • • A^} do 

3: 

Set x}+i 


4: 

end for 


5: 

end for 



Fig. 3. PALM algorithm (summary). 


The following conditions are sufficient (not necessary) to 
ensure that each bounded sequence generated by PALM con¬ 
verges to a stationary point of its objective ^ Theorem 3.1] 
(the sequence converges, which implies convergence of the 
value of the cost function): 

(i) The fjS are proper and lower semi-continuous. 

(ii) H is smooth. 

(iii) Ip is semi-algebraic i fTS} Dehnition 5.1]. 

(iv) is globally Lipschitz for all j, with Lipschitz 
moduli Lj(xi.. .Xj_i,x^+i.. .xjv). 

(v) Vi, j, Cj > . .x*t\,xj_,_i. . .x]v) (the inequality 

need not be strict for convex fj). 


C. Algorithm details 

PALM can be instantiated for the purpose of handling the 
objective of @. It is quite straightforward to see that there is 
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a match between and © by taking N — J + 1, Xj — Sj 
for j G {1... J}, xm +1 — X, H as the data fidelity term, 
fji-) = for j e {1... J} and /,/+i(-) = = 

<Je(’) = 0 (there is no constraint on A). This match allows to 
apply PALM to compute multi-layer sparse approximations, 
with guaranteed convergence to a stationary point. 

1) Projection operator: PALM relies on projections onto 
the constraint sets for each factor at each iteration, so the 
projection operator should be simple and easy to compute. 
For example, in the case where the £jS are sets of sparse nor¬ 
malized matrices, namely £j = {S € : ||vec(S)|jQ < 

Sj, ||S||^ = 1} for j G {1 ■ ■ ■ J}, then the projection operator 
PSj{') simply keeps the Sj greatest entries (in absolute value) 
of its argument, sets all the other entries to zero, and then 
normalizes its argument so that it has unit norm (the proof is 
given in Appendix |^. Regarding £j+i — R, the projection 
operator is the identity mapping. The projection operators for 
other forms of sparsity constraints that could be interesting in 
concrete applications are also given in Appendix [A] Proposi¬ 
tion 


A.l covers the following examples; 


• Global sparsity constraints. 

• Row or column sparsity constraints. 

• constrained support. 

• Triangular matrices constraints. 

• Diagonal matrices constraints. 


Proposition |A.2| covers in addition: 

• Circulant, Toeplitz or Hankel matrices with fixed support 
or prescribed sparsity. 

• Matrices that are constant by row or column. 

• More general classes of piece-wise constant matrices with 
possible sparsity constraints. 


2) Gradient and Lipschitz modulus: To specify the itera¬ 
tions of PALM specialized to the multi-layer sparse approxi¬ 
mation problem, let us fix the iteration i and the factor j, and 
denote S* the factor being updated, L := n/=j-i-i what is 
on its left and R := YleZi what is on its right (with the 
convention Yl£e 0 These notations give, when up¬ 
dating the jth factor S]; H{S\+\ S}t\, SJ,..., Sj, A^ = 

iF(L, S*, R, A*) = III A — A*LS*R|||.. The gradient of this 
smooth part of the objective with respect to the jth factor 
reads: 


Vs.R(L,S;,R, A^ = A*L'^(ATSiR-A)R^, 

which Lipschitz modulus with respect to 11S* 11 ^ is 
Lj(L,R, A®) = (A®)^ ||R|||. ||L||| (as shown in Appendix 
0 . Once all the J factors are updated, let us now turn 
to the update of A. Denoting A = n/=i brings: 

..., A®) = |||A — A®A|||,, and the gradient 

with respect to A® reads: 

Va.R(S®i+\ ..., S®/\ A®) = A®Tr(A'^A) - Tr(A'^A). 


3) Default initialization, and choice of the step size: Except 
when specified otherwise, the default initialization is with 
A° = 1, S° = 0, and Sj = Id for j > 2, with the 
convention that for rectangular matrices the identity has ones 
on the main diagonal and zeroes elsewhere. In practice the 


step size is chosen by taking c® = (1 -I- a).(A®)^ IIR-ll^ ■ ITH^ 
with a — 10“^. Such a determination of the step size is 
computationally costly, and alternatives could be considered 
in applications (a decreasing step size rule for example). 

4) Summary: An explicit version of the algorithm, called 
PALM for Multi-layer Sparse Approximation (palm4MSA), is 
given in Figure]^ in which the factors are updated alternatively 
by a projected gradient step (line 6) with a step-size controlled 
by the Lipschitz modulus of the gradient (line 5). We can solve 
for A directly at each iteration (line 9) because of the absence 
of constraint on it (thanks to the second part of the convergence 
condition (v) of PALM). 


PALM for Multi-layer Sparse Approximation (palm4MSA) 


Input; Operator A; desired number of factors J; constraint 
sets £j, j G {1 ■.. J}; initialization A°; stop¬ 

ping criterion (e.g., number of iterations N). 
for J = 0 to TV — 1 do 


for j = 1 to J do 

T ^ n/=j+i 

Setc®. >(A®)2||R|1 


^ Pe, (s® - iAT^(ALS® R - A)R^ 


end for 

A ^ n;=i s® 


i+l 


\z-ri ^ Tr(A A) 

Tr(AT’A) 

10 : end for 

Output: The estimated factorization; 

A^,{Sf}/^i = palm4MSA(A, J, ...) 


Fig. 4. PALM algorithm for multi-layer sparse approximation. 


IV. Hierarchical factorization 

The algorithm presented in Figure factorizes an input 
matrix corresponding to an operator of interest into J sparse 
factors and converges to a stationary point of the problem 
stated in Q. In practice, one is only interested in the stationary 
points where the data fitting term of the cost function is small, 
however as for any generic non-convex optimization algorithm 
there is no general convergence guarantee to such a stationary 
point. This fact is illustrated by a very simple experiment 
where the algorithm palm4MSA is applied to an input operator 
A G R®®^®® with a known sparse form A = U^i as 

the Hadamard transform (in that case N = log 2 n). The naive 
approach consists in setting directly J = N in palm4MSA, 
and setting the constraints so as to reflect the actual sparsity of 
the true factors (as depicted in Figure [T]l. This simple strategy 
performs quite poorly in practice for most initializations, and 
the attained local minimum is very often not satisfactory (the 
data fidelity part of the objective function is large). 


A. Parallel with deep learning 

Similar issues were faced in the neural network community, 
where it was found difficult to optimize the weights of neural 
networks comprising many hidden layers (called deep neural 
networks, see 1381 for a survey on the topic). Until recently, 
deep networks were often neglected in favor of shallower 
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architectures. However in the last decade, it was proposed | 
to optimize the network not as one big block, but one layer at 
a time, and then globally optimizing the whole network using 
gradient descent. This heuristic was shown experimentally to 
work well on various tasks |37|. More precisely, what was 
proposed is to perform first a pre-training of the layers (each 
being fed the features produced by the one just below, and 
the lowermost being fed the data), in order to initialize the 
weights in a good region to perform then a global fine tuning 
of all layers by simple gradient descent. 


B. Proposed hierarchical algorithm 

We noticed experimentally that taking fewer factors (J 
small) and allowing more non-zero entries per factor led to 
better approximations. This suggested to adopt a hierarchical 
strategy reminiscent of pre-training of deep networks, in order 
to iteratively compute only factorization with 2 factors. Indeed, 
when A = Jl^i '^he product of N sparse factors, it is 
also the product A = TiSi of 2 factors with Ti = 11^2 
so that Si is sparser than Ti. 

1) Optimization strategy: The proposed hierarchical strat¬ 
egy consists in iteratively factorizing the input matrix A into 2 
factors, one being sparse (corresponding to Si), and the other 
less sparse (corresponding to Ti). The process is repeated on 
the less sparse factor Ti until the desired number J of factors 
is attained. At each step, a global optimization of all the factors 
introduced so far can be performed in order to fit the product 
to the original operator A. 

2) Choice of sparsity constraint: A natural question is 

that of how to tune the sparsity of the factors and residuals 
along the process. Denoting — rij^f+i Sj, a simple 
calculation shows that if we expect each Sj to have roughly 
0{k) non-zero entries per row, then Tg cannot have more 
than non-zero entries per row. This suggests to 

decrease exponentially the number of non-zero entries in Tg 
with i and to keep constant 0{k) the number of non-zero 
entries per row in Sj. This choice of the sparsity constraints 
is further studied with experiments in Section |V] 

3) Implementation details: The proposed hierarchical strat- 
eg}0 is summarized in the algorithm given in Figure where 
the constraint sets related to the two factors need to be 
specified for each step: Eg denotes the constraint set related 
to the left factor Tg, and Eg the one for the right factor at 
the Cth factorization. Roughly we can say that line 3 of the 
algorithm is here to yield complexity savings. Line 5 is here 
to improve data fidelity: this global optimization step with 
palm4MSA is initialized with the current values of Tg and 

The hierarchical strategy uses palm4MSA J — 1 
times with an increasing number of factors, and with a good 
initialization provided by the factorization in two factors. This 
makes its cost roughly J — 1 times greater than the cost of the 
basic palm4MSA with J factors. 


Hierarchical factorization _ 

Input: Operator A; desired number of factors J; constraint 
sets Eg and Eg, I {1... J — 1}. 

1 : To ^ A 

2 : for £ = 1 to J — 1 do 

3: Factorize the residual Tg-i into 2 factors: 

'^^{F 2 ,Fl} = palin4MSA(T£_i, 2, (Eg,Eg}, 

init=default) 

4: Tg ^ A'F 2 and ^ Fi 

5: Global optimization: 

A,{T^, = palm4MSA(A, £ -f 1, 

{^Eg, {fj}j_]^},init=current) 

6 : end for 

7: Sj ^ Tj_i 

Output: The estimated factorization: A,{Sj}j^]^. 

Fig. 5. Hierarchical factorization algorithm. 

In greedy layerwise training of deep neural networks, the 
factorizations in two (line 3) would correspond to the pre¬ 
training and the global optimization (line 5) to the fine tuning. 

Remark The hierarchical strategy can also be applied the 
other way around (starting from the left), just by transposing 
the input. We only present here the version that starts from 
the right because the induced notations are simpler. It is also 
worth being noted that stopping criteria other than the total 
number of factors can be set. For example, we could imagine 
to keep factorizing the residual until the approximation error 
at the global optimization step starts rising or exceeds some 
pre-dehned threshold. 


C. Illustration: reverse-engineering the Hadamard transform 

As a hrst illustration of the proposed approach, we tested 
the hierarchical factorization algorithm of Figure when A is 
the dense square matrix associated to the Hadamard transform 
in dimension n = 2 ^. The algorithm was run with J = N 
factors. Eg = {T & ||T||o < ||T||^ = 1}, and 

£:, = {SGK-xM|S||o< 2 n,|lS|l^ = l}. 

In stark contrast with the direct application of palm4MSA 
with J = A, an exact factorization is achieved. Indeed, the 
first step reached an exact factorization A = TiSi inde¬ 
pendently of the initialization. With the default initialization 
(Section |III-C3| ), the residual Ti was observed to be still 
exactly factorizable. All steps (£ > 1) indeed also yielded exact 
factorizations Tg_i = T^S^, provided the default initialization 
was used at each step £ G {1 ,..., J — 1 }. 

Figure illustrates the result of the proposed hierarchical 
strategy in dimension n = 32. The obtained factorization is 
exact and as good as the reference one (cf Figure [T]) in terms 
of complexity savings. The running time of the factorization 
algorithm is less than a second. Factorization of the Hadamard 
matrix in dimension up to n = 1024 showed identical 
behaviour, with running times 0{n^) up to ten minutes. 


toolbox implementing all the algorithms and experiments performed in 
this paper is available at http: / / f aust. gf orge . inr ia. f r 
All experiments were performed in Matlab on an laptop with an intel(R) 
core(TM) i7-3667U @ 2.00GHz (two cores). 


V. Accelerating inverse problems 

A natural application of FA/tSTs is linear inverse problems, 
where a high-dimensional vector 7 needs to be retrieved from 







IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING 


7 


Step 1: Step 2: Step 3: Step 4: 





Pp-’ 


X 


kn 


J 


Fig. 7. Factorization setting: each factor is represented with its total sparsity. 

other factors Sj, j G {2,J} were set square, with global 


x204 


lo — 


= !}■ 


Fig. 6. Hierarchical factorization of the Hadamard matrix of size 32 x 32. 
The matrix is iteratively factorized in 2 factors, until we have J = 5 factors, 
each having s = 64 non-zero entries. 

some observed data \ 


:! M 7 . As already evoked in Sec- 
II-B iterative proximal algorithms can be expected to be 


tion 

significantly sped up if M is well approximated with a FApST 
of low relative complexity, for example using the proposed 
hierarchical factorization algorithm applied to A = M. 

In practice, one needs to specify the total number of factors 
J and the constraint sets 5^, A preliminary study on 
synthetic data was carried out in our technical report ||Tj, 
showing that a flexible trade-off between relative complexity 
and adaptation to the input matrix can be achieved. Here we 
leverage the rule of thumb presented in Section [Tll-CI to deepen 
the investigation of this question for a matrix M arising in a 
real-world biomedical linear inverse problem. 

A. Factorization compromise: MEG operator 

In this experiment, we explore the use of FA/rST in the con¬ 
text of functional brain imaging using magnetoencephalogra¬ 
phy (MEG) and electroencephalography (EEG) signals. Source 
imaging with MEG and EEG delivers insights into the active 
brain at a millisecond time scale in a non-invasive way. To 
achieve this, one needs to solve the bioelectromagnetic inverse 
problem. It is a high dimensional ill-posed regression problem 
requiring proper regularization. As it is natural to assume that 
a limited set of brain foci are active during a cognitive task, 
sparse focal source configurations are commonly promoted 
using convex sparse priors lig, ig. The bottleneck in the 
optimization algorithms are the dot products with the forward 
matrix and its transpose. 

The objective of this experiment is to observe achievable 
trade-offs between relative complexity and accuracy. To this 
end, we consider an MEG gain matrix M S ]^204x8i93 
(m = 204 and n = 8193), computed using the MNE software 
p2) implementing a Boundary Element Method (BEM). In 
this setting, sensors and sources are not on a regular spatial 
grid. Note that in this configuration, one cannot easily rely 
on classical operator compression methods presented in sec¬ 
tions |II-C2| that rely on the analytic expression of the kernel 
underlying M, or on those presented in section II-C3 that rely 
on some regularity of the input and output domains to define 
wavelets. Hence, in order to observe the complexity/accuracy 
trade-offs, M was factorized into J sparse factors using the 
hierarchical factorization algorithm of Eigure 

1) Settings: The rightmost factor Si was of the size of M, 
but with /c-sparse columns, corresponding to the constraint 
set = {S e M204x8193j|g^|| < = 1 }. All 


sparsity s, i.e. = {S S 

The “residual” at each step T^, £ G {1,..., J— 1} was also 
set square, with global sparsity geometrically decreasing with 
£, controlled by two parameters p and P. This corresponds 
to the constraint set^ = {T e K204 x 204 ||rp|j^ < 

Pp^~^, ||T||^ = 1} for £ e {1,..., J — 1}. The controlling 
parameters are set to; 

• Number of factors: J G {2,, 10}. 

• Sparsity of the rightmost factor: k G 

{5,10,15,20,25,30}. 

• Sparsity of the other factors: s G {2m, 4m, 8m}. 

• Rate of decrease of the residual sparsity: p = 0.8. 

The parameter P controlling the global sparsity in the residual 
was found to have only limited influence, and was set to P = 
1.4 X w?. Other values for p were tested, leading to slightly 
different but qualitatively similar complexity/accuracy trade¬ 
offs not shown here. The factorization setting is summarized 
in Eigure [7] where the sparsity of each factor is explicitly 
given. 

2) Results: Eactorizations were computed for 127 parame¬ 
ter settings. The computation time for each factorization was 
around (J — 1) x 10 minutes. Eigure displays the trade¬ 
off between speed (the RCG measure ([^ and approximation 
error: 


RE := 


M-ARtiS, 


3 = 1 ‘-’JII 2 


|M||, 


( 6 ) 


of each obtained EA/iST. We observe that: 

• The overall relative complexity of the obtained fac¬ 
torization is essentially controlled by the parameter k. 
This seems natural, since k controls the sparsity of the 
rightmost factor which is way larger than the other ones. 

• The trade-off between complexity and approximation for 
a given k is mainly driven by the number of factors J: 
higher values of J lead to lower relative complexities, but 
a too large J leads to a higher relative error. Taking J = 2 
(black dots) never yields the best compromise, hence the 
relevance of truly multi-layer sparse approximations. 

• Eor a fixed k, one can distinguish nearby trade-off curves 
corresponding to different sparsity levels s of the inter¬ 
mediate factors. The parameter s actually controls the 
horizontal spacing between two consecutive points on the 
same curve: a higher s allows to take a higher J without 
increasing the error, but in turn leads to a higher relative 
complexity for a given number J of factors. 

In summary, one can distinguish as expected a trade-off 
between relative complexity and approximation error. The 

'^Compared to a preliminary version of this experiment |l7| where the 
residual was normalized columnwise at the first step, here it is normalized 
globally. This leads to slightly better results. 
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Fig. 8. Results of the factorization of an m X n = 204 X 8193 MEG matrix. The shape of the symbols denotes the number of factors J (•: J = 2, 
0:J = 3, *:J = 4, 0:7 = 5, lV:J = 6, X :J = 7, A:J = 8, 0:J = 9, >;J = 10), and the color the value of the parameter s. 


configuration exhibiting the lowest relative error for each value 
of k is highlighted on Figure]^ this gives M 25 , Mig, Mu, 
Mg, My, Mg, where the subscript indicates the achieved 
RCG (rounded to the closest integer). For example. Mg can 
multiply vectors with 6 times less flops than M (saving 84% 
of computation), and M 25 can multiply vectors with 25 times 
less flops than M (96% savings). These six matrices are those 
appearing on Figure to compare FA/rSTs to the truncated 
SVD. They will next be used to solve an inverse problem and 
compared to results obtained with M. 

Remark Slightly smaller approximation errors can be ob¬ 
tained by imposing a global sparsity constraint to the rightmost 
factor, i.e., = {S G K^otxsiqs j|s||^ < /tn, ||S|1^ = 1}. 

This is shown on Figure by the points linked by a dashed 
line to the six matrices outlined above. However, such a global 
sparsity constraint also allows the appearance of null columns 
in the approximations of M, which is undesirable for the 
application considered next. 

B. Source localization experiment 

We now assess the impact of replacing the MEG gain 
matrix M G K^otxsiqs ^ FApST approximation for brain 
source localization. For this synthetic experiment, two brain 
sources chosen located uniformly at random were activated 
with gaussian random weights, giving a 2-sparse vector 7 G 
K8193^ whose support encodes the localization of the sources. 
Observing y := M 7 , the objective is to estimate (the support 
of) 7 . The experiment then amounts to solving the inverse 
problem to get 7 from the measurements y = M 7 G 
using either M or a FApST M during the recovery process. 

Three recovery methods were tested; Orthogonal Matching 
Pursuit (OMP) [To) (choosing 2 atoms), -regularized least 
squares (Ijls) |43| and Iterative Hard Thresholding (IHT) 
They yielded qualitatively similar results, and for the 
sake of conciseness we present here only the results for OMP. 

The matrices used for the recovery are the actual matrix 
M and its FA/iST approximations M 25 , Mig, Mu, Mg, My 
and Mg. The expected computational gain of using a FApST 
instead of M is of the order of RCG, since the computational 
cost of OMP is dominated by products with M^. 


Three configurations were considered when generating the 
location of the sources, in term of distance d (in centimeters) 
between the sources. For each configuration, 500 vectors y = 
M 7 were generated, and OMP was run using each matrix. The 
distance between each actual source and the closest retrieved 
source was measured. Figure displays the statistics of this 
distance for all scenarios: 

• As expected, localization is better when the sources are 
more separated, independently of the choice of matrix. 

• Most importantly, the performance is almost as good 
when using FA/iSTs Mg, My, Mg and Mu than when 
using the actual matrix M, although the FAp,STs are 
way more computationally efficient (6 to 11 times less 
computations). For example, in the case of well separated 
sources {d > 8 ), the FA/rSTs allow to retrieve exactly 
the sought sources more than 75% of the time, which is 
almost as good as when using the actual matrix M. 

• The performance with the two other FA/iSTs Myg and 
M 25 is a bit poorer, but they are even more computa¬ 
tionally efficient matrix (16 and 25 times less computa¬ 
tions). For example, in the case of well separated sources 
{d > 8 ), they allow to retrieve exactly the sought sources 
more than 50% of the time. 

These observations confirm it is possible to slighlty trade-off 
localization performance for substantial computational gains, 
and that FApSTs can be used to speed up inverse problems 
without a large precision loss. 

VI. Learning east dictionaries 

Multi-layer sparse approximations of operators are par¬ 
ticularly suited for choosing efficient dictionaries for data 
processing tasks. 

A. Analytic vs. learned dictionaries 

Classically, there are two paths to choose a dictionary for 
sparse signal representations | [T3| . 

Historically, the only way to come up with a dictionary 
was to analyze mathematically the data and derive a “simple” 
formula to construct the dictionary. Dictionaries designed this 
way are called analytic dictionaries GD (e.g., associated 
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Fig. 9. Localization performance (distance between actual and retrieved source) obtained with various matrices, for different distances between actual sources. 


to Fourier, wavelets and Hadamard transforms). Due to the 
relative simplicity of analytic dictionaries, they usually have a 
known sparse form such as the Fast Fourier Transform (FFT) 
Q or the Discrete Wavelet Transform (DWT) Q. 

On the other hand, the development of modern computers 
allowed the surfacing of automatic methods that learn a 
dictionary directly from the data p4)-p6). Given some raw 
data Y S the principle of dictionary learning is to 

approximate Y by the product of a dictionary D G 
and a coefficient matrix F G with sparse columns: 

Y « DT. 

Such learned dictionaries are usually well adapted to the data 
at hand. However, being in general dense matrices with no 
apparent structure, they do not lead to fast algorithms and are 
costly to store. We typically have L ^ max{m, n) (for sample 
complexity reasons), which implies to be very careful about 
the computational efficiency of learning in that case. 

B. The best of both worlds 

Can one design dictionaries as well adapted to the data 
as learned dictionaries, while as fast to manipulate and as 
cheap to store as analytic ones? This question has begun to be 
explored recently H), HD, and actually amounts to learning 
of dictionary that are FApSTs. More precisely, given Y, the 
objective is to learn a dictionary being a FA/iST (as in Q): 

j 

D=ns, 

f=i 

This can be done by inserting a dictionary factorization step 
into the traditional structure of dictionary learning algorithms 
as illustrated on Figure A consequence is that the 
coefficients update can be sped up by exploiting the FApST 


— Dictionary update -— 
► Coefficients update — 


■^Dictionary factorization 


Fig. 10. Classical dictionary learning algorithm structure (in roman). In italic, 
the added dictionary factorization step, specific to the approach presented here. 


structure of the dictionary. The approach described below uses 
a batch method for dictionary update, but the approach is a 
priori also compatible with stochastic gradient descent in the 
dictionary update for even more efficiency. 

In practice we propose to slightly modify the hierarchical 
factorization algorithm of Figure The idea is to take a 
dictionary D learned on some training data Y (with any 
classical dictionary learning method, such as K-SVD pdj ) and 
to hierarchically factorize it, taking into account and jointly 
updating the coefficients matrix F. 

The resulting hierarchical factorization algorithm adapted to 
dictionary learning is given in Figure The only differences 
with the hierarchical factorization algorithm given previously 
is that the coefficients matrix is taken into account (but kept 
fixed) in the global optimization step, and that an update of 
the coefficients by sparse coding is added after this global 
optimization step, in order to keep the error with respect to 
the data matrix low. This sparse coding step can actually be 
done by any algorithm (OMP, IHT, ISTA...), denoted by the 
general sparseCoding algorithm in Figure 0 

Remark As noted in section [TV-B 3 [ the dictionary factoriza¬ 
tion is presented here starting from the right. It could as well 
be performed starting from the left. 


C. Image denoising experiment 

In order to illustrate the advantages of FA/iST dictionaries 
over classical dense ones, an image denoising experiment 
is performed here. The experimental scenario for this task 
follows a simplified dictionary based image denoising work- 
flow. First, L = 10000 patches of size 8x8 (dimension 
m = 64) are randomly picked from an input 512 x 512 
noisy image (with various noise levels, of variance cr G 
{10,15,20,30,50}), and a dictionary is learned on these 
patches. Then the learned dictionary is used to denoise the 
entire input image by computing the sparse representation of 
all its patches in the dictionary using OMP, allowing each 
patch to use 5 dictionary atoms. The image is reconstructed 
by averaging the overlapping patches. 

Experimental settings. Several configurations were tested. 
The number of atoms n was taken in (128, 256, 512}. Inspired 
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Hierarchical factorization for dictionary learning _ 

Input: Data matrix Y; Initial dictionary D and coefficients 
r (e.g., from K-SVD); desired number of factors J; 
constraint sets 8^ and f S {1... J — 1}. 

1 : To ^ D, To ^ r 
2 : for £ = 1 to J — 1 do 

3: Dictionary factorization: factorize the residual T^_i 

into 2 factors 

^^{F 2 ,Fl} = palm4MSA(T^_i, 2, {££,££}, 

init=default) 

T^ ^— A^F 2 and Sf 3 — 

4: Dictionary update: global optimization 

A,{T^,{Sj}^^^i,F£_i} = palm4MSA(Y, £ + 2, 
{££, {£jYj^-£,{T£_i}], init=current) 

5: Coefficients update: 

Ff = sparseCoding(Y, Tf Sj) 

6 : end for 
7: S,7 ^ Tj_i 

Output: The estimated factorization; A,{Sj}^£^j^. 

Fig. 11. Hierarchical factorization algorithm for dictionary learning. 


by usual fast transforms, a number of factors J close to the 
logarithm of the signal dimension m = 64 was chosen, here 
J = 4. The sizes of the factors were; Sj, ...,82 G 
Si G and F G The algorithm of Figure [IT] was 

used, with the initial dictionary learning being done by K-SVD 
and sparseCoding being OMP, allowing each patch to 
use 5 dictionary atoms. Regarding the constraint sets, we took 
them exactly like in section V-A taking s/m G {2, 3, 6 ,12}, 
p G {0.4,0.5,0.7,0.9}, P = 64^ and k = s/m. For each 
dictionary size n, this amounts to a total of sixteen different 
configurations leading to different relative complexity values. 
The stopping criterion for palm4MSA was a number of 
iterations Ni = 50. Note that the usage of OMP here and at the 
denoising stage is a bit abusive since the dictionary does not 
have unit-norm columns (the factors are normalized instead), 
but it was used anyway, resulting in a sort of weighted OMP, 
where some atoms have more weight than others. 

Baselines. The proposed method was compared to Dense 
Dictionary Learning (DDL). K-SVD is used here to perform 
DDL, but other algorithms have been tested (such as online 
dictionary learning p 6 |), leading to similar qualitative results. 
In order to assess the generalization performance and to be 
as close as possible to the matrix factorization framework 
studied theoretically in |20|, DDL is performed following the 
same denoising workflow than our method (dictionary learned 
on 10000 noisy patches used to denoise the whole image, 
allowing five atoms per patch). The implementation described 
in g7) was used, running 50 iterations (empirically sufficient 
to ensure convergence). 


Note that better denoising performance can be obtained 
by inserting dictionary learning into a more sophisticated 
denoising workflows, see e.g. pS) . State of the art denoising 
algorithms indeed often rely on clever averaging procedure 
called “aggregation”. Our purpose here is primarily to illustrate 
the potential of the proposed FA/iST structure for denoising. 
While such workflows are fully compatible with the FApST 


structure, we leave the implementation and careful benchmark¬ 
ing of the resulting denoising systems to future work. 

As a last baseline, we used the above denoising scheme 
with an overcomplete DCT of 128, 256 or 512 atoms. 
Results. The experiment is done on the standard image 
database taken from ||49t (12 standard grey 512 x 512 images). 


In Figure 12 are shown the results for three images; the one 
for which FA/rST dictionaries perform worst (“Mandrill”), the 
one for which they perform best (“WomanDarkHair”) and the 
typical behaviour (“Pirate”). Several comments are in order; 

• First of all, it is clear that with the considered simple 
denoising workflow, FA/rST dictionaries perform better 
than DDL at strong noise levels, namely cr = 30 
and tj = 50. This can be explained by the fact that 
when training patches are very noisy, DDL is prone to 
overfitting (learning the noise), whereas the structure of 
FA/rSTs seems to prevent it. On the other hand, for low 
noise levels, we pay the lack of adaptivity of FA/rSTs 
compared to DDL. Indeed, especially for very textured 
images (“Mandrill” typically), the dictionary must be very 
flexible in order to fit such complex training patches, so 
DDL performs better. FApST dictionaries also perform 
better than DCT dictionaries at high noise levels. 

• Second, it seems that sparser FA/rSTs (with fewer pa¬ 
rameters) perform better than denser ones for high noise 
levels. This can be explained by the fact that they are 
less prone to overfitting because of their lower number of 
parameters, implying fewer degrees of freedom. However, 
this is not true for low noise levels or with too few 
parameters, since in that case the loss of adaptivity with 
respect to the training patches is too important. 

D. Sample complexity of FApSTs 

The good performance of FA/rST dictionaries compared 
to dense ones observed above may be surprising, since the 
more constrained structure of such dictionaries (compared 
to dense dictionaries) may bar them from providing good 
approximations of the considered patches. A possible element 
of explanation stems from the notion of sample complexity; 


as evoked in section II-B the statistical significance of learned 
multi-layer sparse operators is expected to be improved com¬ 
pared to that of dense operators, thanks to a reduced sample 
complexity. 

In the context of dictionary learning, the sample complexity 
indicates how many training samples L should be taken in 
order for the empirical risk to be (with high probability) 
uniformly close to its expectation |50|, m- In | [20) , a general 
bound on the deviation between the empirical risk and its 
expectation is provided, which is proportional to the covering 
dimension of the dictionary class. 

For dense dictionaries the covering dimension is known 
to be 0{mn) p()| , pT) . For FA/iST dictionaries we 

establish in Appendix |Q the following theorem. 

Theorem VI.l. For multi-layer sparse operators, the covering 
dimension is bounded by Stot- 

A consequence is that for the same number of training 
samples L, a better generalization performance is expected 
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Fig. 12. Denoising results. The relative performance of FA/iST dictionaries 
compared to DDL, and the DCT compared to DDL is given for several noise 
levels <j, for various values of Stot on the horizontal axis. 


from FA/rST dictionaries compared to dense ones, the gain 
being of the order of RCG. This fact is likely to explain the 
empirical success of FA/rSTs compared to dense dictionaries 
observed in section [VI-C| at low SNR. Of course, when Stot be¬ 
comes to small, the limited approximation capacity of FApST 
dictionaries imposes a trade-off between approximation and 
generalization. 


VII. Conclusion and future work 

In this paper, a novel multi-layer matrix factorization frame¬ 
work was introduced, which allows to approximate a given 
linear operator by the composition of several ones. The 
underlying factorization algorithm stems on recent advances 
in non-convex optimization and has convergence guarantees. 
The proposed approach consists in hierarchically factorizing 
the input matrix in the hope of attaining better local minima, 
as is done for example in deep learning. The factorization 
algorithm that is used is pretty general and is able to take 
into account various constraints. One practical constraint of 
interest is sparsity (of various forms), which has several 
interesting properties. Indeed, multi-layer sparsely factorized 
linear operators have several advantages over classical dense 
ones, such as an increased speed of manipulation, a lighter 
storage footpring, and a higher statistical significance when 
estimated on training data. 

The interest and versatility of the proposed factorization ap¬ 
proach was demonstrated with various experiments, including 
a source localization one where the proposed method performs 
well with a greatly reduced computational cost compared 
to previous techniques. We performed also image denoising 


experiments demonstrating that the proposed method has bet¬ 
ter generalization performance than dense dictionary learning 
with an impact at low SNR. 

In the future, several developments are expected. On the 
theoretical side, we envision bounds on the trade-off between 
approximation quality and relative complexity. On the experi¬ 
mental side, new applications for FA/iSTs are to be explored. 
For example, signal processing on graphs is a relatively new 
discipline where computationally efficient operators can be 
envisioned using learned FA/rST dictionaries, or a FApST 
approximation of graph Fourier transforms. Moreover, the 
factorization cost being quite high, one could envision ways 
to approximate matrices by FA/rSTs without accessing the 
whole matrix, in order to reduce this cost. For example, one 
could imagine to have observations of the form (xi,yi = 
Axi) and try to minimize a data fitting term of the form 

Er=i i|yi-n/=iSjx,|i2. 

Appendix A 
Proiection operators 

In this appendix are given the projection operators onto 
several constraint sets of interest for practical applications. 


A. General sparsity constraint 

Sparsity is the most obvious constraint to put on the 
factors for operator sparse approximations. Consider first the 
following general sparsity constraint set: 

£■ := {S G : ||S«J|o < s,V^ G {1,. .., A}, ||S||^ = 1}, 

where {Hi, ..., Hk} forms a partition of the index set, Si G 
N, Vi G {1,..., A}, and S 7 - is the matrix whose entries match 
those of S on T and are set to zero elsewhere. Given some 
matrix U G we wish to compute its projection onto the 

set £: PsiV) G argmin{||S - U||5, : S G £}. 
s 

Proposition A.l. Projection operator formula. 


Ps{V) = 


U7 


llUill^ 

where T is the index set corresponding to the union of the Si 
entries of XJui with largest absolute value, Vi G {1,..., A}. 


Proof Let S be an element of £ and J its support. We have 
l|S - u||^ = 1 -f ||U||^ - 2(vec(Uj),vec(S)).For a given 
support, the matrix S G f maximizing (vec(Uj),vec(S)) 
is S = Uj/||Uj||^. For this matrix, (vec(Uj),vec(S)) = 

||Uj||^ = \IYh=i IIUjnwJlFwhich is maximized if 
corresponds to the Si entries with largest absolute value of U 
within Ai, Vi G {1,..., A}. □ 


B. Sparse and piecewise constant constraints 

Given A pairwise disjoint sets Ci indexing matrix entries, 
consider now the constraint set corresponding to unit norm 
matrices that are constant over each index set Ci, zero outside 
these sets, with no more than s non-zero areas. In other words: 
£, := {S G : 3a = ||a||o < s,Sc, = 5,Vi G 

A}, Sq^ = 0, and ||S||^ = 1}. 
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Define u := with Ui := J2{m,n)GC, '^rnn, and 

denote J C {1,..., if} the support of a. 


Proposition A.2. The projection o/U onto £c is obtained with 
J the collection of s indices i yielding the highest |ui|/\/jC7f, 
ai ■■= Uij if 'i ^ J’ Oi := 0 otherwise. 


Proof. Let S be an element of £c, and J C 
be the support of the associated a. We proceed as for the 
previous proposition and notice that (vec(U),vec(S)) = 
E*Gj^(''^‘^(Uc.),vec(S)) = By the 

changes of variable bi = and Vi = Ui/^J\Ci\ we get 

(uj,a) = (vji,b) with b := Given J, maximizing 

this scalar product under the constraint 1 = ||S||^ = |jb ||2 
yields b* := Vy/||vy|| 2 , and 


(vj.,b*) = 




Max¬ 


imizing over J is achieved by selecting the s entri es o f 
V := (v)Ei with largest absolute value (Proposition A.li. 
Going back to the original variables gives the result. □ 


pi{x,y) = \\x-y\\p. in D^pfac we get that the mapping $ 
is a contraction (by induction). We can conclude that; 

-.7 


ff (2?spfac) e) < 


ajUj+i 

Si 




wrt the Pi metric. Using (P < ^ and n! > y/2'Kn 
yields A/'(X>spfac, e) < ULi —/= 


j:ajaj+i{l + — 

DehningQ = ^ we have A/'(X>spfac, e) < 

with h = Yfi^iSj and C = maxC'j. We thus have (i(I?spfac) < 

h = ^3 ~ Stot' 
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Appendix B 
Lipschitz modulus 

To estimate the Lipschitz modulus of the gradient of the 
smooth part of the objective we write: 

IVs.iT(L, Si, R, W) - Vs.iT(L, S 2 , R, W))||^ 

= (W)2||L^L(Si-S2)RR^||^ 

< (W)2||R||^||L||^||Si-S2||^. 


Appendix C 
Covering dimension 


The covering number N{A, e) of a set A is the minimum 
number of balls of radius e needed to cover it. The precise 
dehnition of covering numbers is given in pO) . The upper- 
box counting dimension of the set, loosely referred to as the 
covering dimension in the text is d{A) = lime_,.o ■ 

We are interested in the covering dimension of the set of 
FApSTs Dspfac- We begin with the elementary sets £j = {A G 
j^ajxaj+i . ||_4 ||q < Sj,|jA||^ = 1}. These sets can be seen 
as sets of sparse normalized vectors of size aj x aj+i. This 
leads following p0| (with the Frobenius norm) to: 


Dehning M := £i x ... x £j and using |20 lemma 16] 
gives JV{M,e) < (wrt to the max 


metric over the index j using the Frobenius norm). Using 

J2'Li\\xj - VjWp < Jmax\\xj - yflp gives Af{M,e) < 

3 

Wj=i wrt to the metric dehned by 

P{x,y) = J2i=i \\xj - VjWp- Dehning the mapping; 


$ : fxl :— £i X ... X £j v T^spfac 

(Di, D2,..., Dj) Dj ... D2D1, 


where Dspfac is the set of FA/rSTs of interest, and using the 
distance measures p{x,y) = Yfj=i\\x 3 ~ DjW p in Ai and 
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